Intro: A Student Grade Classification Project

There is a rising trend in using data to determine student performance and provide timely intervention for low-performing students/at-risk students. This project hopes to help highlight and contribute to these efforts in education innovation.

Having previously worked at a classroom learning/management platform edtech startup, I was inspired to do this project as a way to learn more about the process of building a data-driven student intervention system as well as to understand how data science can be used to transform education.

The dataset below was provided by Kaggle, and gives a snapshot of student engagement and student background as well as their corresponding final grades (classified by 3 categories: high-level grades, middle-levelgrades, & low-level grades.

This project aims to determine which factors are the greatest indicators in identifying a student as being at‐risk either behaviorally or academically and to make predictions, based on these factors, of which students belong to which grade class (high, middle or low).

Data Collection & Features

Source of Dataset: Kaggle

The following dataset includes many factors which may influence a student's final grades. Below is a description of each:

1 Gender - student's gender (nominal: 'Male' or 'Female’)

2 Nationality- student's nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

3 Place of birth- student's Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)

5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)

6 Section ID- classroom student belongs (nominal:’A’,’B’,’C’)

7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)

8 Semester- school year semester (nominal:’ First’,’ Second’)

9 Parent responsible for student (nominal:’mom’,’father’)

10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)

11- Visited resources- how many times the student visits a course content(numeric:0-100)

12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)

13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)

14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)

15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)

16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)

The students are classified into three numerical intervals based on their total grade/mark:

Low-Level: interval includes values from 0 to 69,

Middle-Level: interval includes values from 70 to 89,

High-Level: interval includes values from 90-100.

Data Handling

  • Importing Data with Pandas: Read in a csv of our data
  • Identify the shape of the dataset: Tells us we have 480 observations/students to analyze
  • data --> Summary of our data contained in a Pandas DataFrame: Preview of features and values

In [1]:
import numpy as np
import pandas as pd


data = pd.read_csv('xAPI-Edu-Data.csv') #columns = ['Gender','Nationality', 'PlaceofBirth','StageID','GradeID','SectionID'
                                              #,'Topic','Semester','Relation','RaisedHands','VisitedResources'
                                              #,'AnnoucementsView','Discussion','ParentAnsweringSurvey',
                                              #'ParentSchoolSatisfaction','StudentAbsenceDays','Class/FinalGrade'])

print (data.shape)

    
data.head(15)


(480, 17)
Out[1]:
gender NationalITy PlaceofBirth StageID GradeID SectionID Topic Semester Relation raisedhands VisITedResources AnnouncementsView Discussion ParentAnsweringSurvey ParentschoolSatisfaction StudentAbsenceDays Class
0 M KW KuwaIT lowerlevel G-04 A IT F Father 15 16 2 20 Yes Good Under-7 M
1 M KW KuwaIT lowerlevel G-04 A IT F Father 20 20 3 25 Yes Good Under-7 M
2 M KW KuwaIT lowerlevel G-04 A IT F Father 10 7 0 30 No Bad Above-7 L
3 M KW KuwaIT lowerlevel G-04 A IT F Father 30 25 5 35 No Bad Above-7 L
4 M KW KuwaIT lowerlevel G-04 A IT F Father 40 50 12 50 No Bad Above-7 M
5 F KW KuwaIT lowerlevel G-04 A IT F Father 42 30 13 70 Yes Bad Above-7 M
6 M KW KuwaIT MiddleSchool G-07 A Math F Father 35 12 0 17 No Bad Above-7 L
7 M KW KuwaIT MiddleSchool G-07 A Math F Father 50 10 15 22 Yes Good Under-7 M
8 F KW KuwaIT MiddleSchool G-07 A Math F Father 12 21 16 50 Yes Good Under-7 M
9 F KW KuwaIT MiddleSchool G-07 B IT F Father 70 80 25 70 Yes Good Under-7 M
10 M KW KuwaIT MiddleSchool G-07 A Math F Father 50 88 30 80 Yes Good Under-7 H
11 M KW KuwaIT MiddleSchool G-07 B Math F Father 19 6 19 12 Yes Good Under-7 M
12 M KW KuwaIT lowerlevel G-04 A IT F Father 5 1 0 11 No Bad Above-7 L
13 M lebanon lebanon MiddleSchool G-08 A Math F Father 20 14 12 19 No Bad Above-7 L
14 F KW KuwaIT MiddleSchool G-08 A Math F Mum 62 70 44 60 No Bad Above-7 H

Step #1 - Exploring/Cleaning the data

Summary of Statistics:

  • Use Pandas' describe function to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution (for all numeric columns).
  • By looking at info: We have the total number of observations = 480, and we see there are no missing (non-null) values (i.e., we have a clean data set).

In [2]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
gender                      480 non-null object
NationalITy                 480 non-null object
PlaceofBirth                480 non-null object
StageID                     480 non-null object
GradeID                     480 non-null object
SectionID                   480 non-null object
Topic                       480 non-null object
Semester                    480 non-null object
Relation                    480 non-null object
raisedhands                 480 non-null int64
VisITedResources            480 non-null int64
AnnouncementsView           480 non-null int64
Discussion                  480 non-null int64
ParentAnsweringSurvey       480 non-null object
ParentschoolSatisfaction    480 non-null object
StudentAbsenceDays          480 non-null object
Class                       480 non-null object
dtypes: int64(4), object(13)
memory usage: 63.8+ KB

In [3]:
data.describe()


Out[3]:
raisedhands VisITedResources AnnouncementsView Discussion
count 480.000000 480.000000 480.000000 480.000000
mean 46.775000 54.797917 37.918750 43.283333
std 30.779223 33.080007 26.611244 27.637735
min 0.000000 0.000000 0.000000 1.000000
25% 15.750000 20.000000 14.000000 20.000000
50% 50.000000 65.000000 33.000000 39.000000
75% 75.000000 84.000000 58.000000 70.000000
max 100.000000 99.000000 98.000000 99.000000

Summary of Statistics [separated by class]: 'High, Middle, Low'

Now that we have an overview for the entire dataset, let's dig deeper and focus on the features we're primarily interested in: the grades the student receive.

  • Using pandas grouby we can split the data into groups based on some criteria (in this case --- the class column 'H, M, L').

  • Using the aggregate function with the groupby, we are able to compute the summary statistics about each group (in this case --- the min, median, mean and max for each separate class 'H, M, L').

We then compare the summary statistics for total students (table above) vs. students for each separate classes 'H, M, L' (table below). The information presented below is more granular & outlines differences in class.

For example: We see that the average number of times students with 'H' raised their hands is around 70.23, whereas for students with 'L' that number drops to around 17. This separation of mean values will become important later when we preprocess the data.


In [4]:
data.groupby('Class').aggregate(['min', np.median, np.mean, max])


Out[4]:
raisedhands VisITedResources AnnouncementsView Discussion
min median mean max min median mean max min median mean max min median mean max
Class
H 10 75 70.288732 100 4 84 78.746479 99 2 52 53.380282 98 2 54 53.661972 99
L 0 10 16.889764 80 0 11 18.322835 90 0 11 15.574803 66 1 21 30.834646 98
M 0 50 48.938389 100 2 72 60.635071 99 0 38 40.962085 93 3 40 43.791469 98

Data Visualization & Exploratory Analysis

  • Libraries used: Matplotlib & Seaborn

First step in exploring the data to classify the grades of students is to look at a simple value count.

Next we visualize the number of students in each of the separate classes to get an idea of how evenly spread the 'H, L, and M's are among the students in the dataset.

As the bar graph shows, distribution is fairly even, no major skews toward any individual class.


In [42]:
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()

data.Class.value_counts().plot(kind='bar')

data.Class.value_counts()


Out[42]:
M    211
H    142
L    127
Name: Class, dtype: int64

Categorical Feature Analysis - Parent Satisfation & Relation [Separated by Class: H, M, L]

Now we want to pair those 'High, Low and Middle'grade value counts with other features that may indicate patterns in the distribution of grades and add value to our analysis.

Below are 4 bar graphs for which grades students received based on Parent satisfaction (good or bad) and Relation responsible for the student (Father or Mum)

There is a clear pattern where:

High grade = Parent schoool satisfaction (good) & Relation responsible (Mum)

Low grade = Parent schoool satisfaction (bad) & Relation responsible (Father)


In [41]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(18,4))

plt.subplot(141)
good_sat= data.Class[data.ParentschoolSatisfaction== 'Good'].value_counts()
good_sat.plot(kind='bar')
plt.title('Parent school satisfaction = Good')

plt.subplot(142)
bad_sat = data.Class[data.ParentschoolSatisfaction== 'Bad'].value_counts()
bad_sat.plot(kind='bar')
plt.title('Parent school satisfaction = Bad')

plt.subplot(143)
survey_no = data.Class[data.Relation == 'Father'].value_counts()
survey_no.plot(kind='bar')
plt.title('Father Responsible for Student')

plt.subplot(144)
survey_yes = data.Class[data.Relation == 'Mum'].value_counts()
survey_yes.plot(kind='bar')
plt.title('Mother Responsible for Student')


Out[41]:
<matplotlib.text.Text at 0x11b202588>

Once again, by using the pandas groupby function, we can get a table showing the exact count of the distribution of students in H, L, M --- looking at the first feature "Parent School Satisfaction" only.


In [7]:
grades_count = data.groupby(['ParentschoolSatisfaction','Class'])['Class'].aggregate('count').unstack()

grades_count


Out[7]:
Class H L M
ParentschoolSatisfaction
Bad 24 84 80
Good 118 43 131

Digging a little deeper:

Using pandas pivot_table we can capture more complex insights from the data & further break down 'H, L, M' class structure by including two levels of analysis (both Parent school satisfaction & Relation).

It is common to start with simple analysis with one feature and add complexity with multiple features as we understand how they interact with our target values of interest.


In [8]:
#shows that adding relation only adds noise, not a significant difference between father & mum 

data.pivot_table('raisedhands', index = ['ParentschoolSatisfaction','Relation'], 
                 columns = 'Class', aggfunc = 'mean')


Out[8]:
Class H L M
ParentschoolSatisfaction Relation
Bad Father 73.375000 16.282051 42.793103
Mum 75.687500 16.000000 45.818182
Good Father 65.764706 17.076923 45.240506
Mum 70.797619 19.705882 62.730769

Here is a visualization of the above pivot table. The combination of two features do not vary too much within the 'H, M, L' classes, but there is a distict gap showing that students in 'H' class raised their hands at an average of around 3 times as often as 'L' students.


In [40]:
parents_hands = data.pivot_table('raisedhands', index = ['ParentschoolSatisfaction','Relation'], 
                 columns = 'Class', aggfunc = 'mean')

parents_hands.plot()
plt.ylabel('Average number of times student raised hand')


Out[40]:
<matplotlib.text.Text at 0x11b56eef0>

Categorical Feature Analysis Cont.

Another good feature that may add value to our analysis of factors that contributes to a student's class is their attendance record.

Again, looking at value counts for the separate classes, we see that the number of students who had under-7 absences and received a 'Low' class/grade is very low. (and vice versa for above-7 absences & 'H')


In [39]:
import matplotlib.pyplot as plt

attendance = pd.crosstab(index=data['StudentAbsenceDays'], columns=[data['Class']], normalize='columns')


attendance.plot(kind='bar', figsize=(6,6), stacked=True)


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b645278>

Moving on to numerical feature analysis:

Look at the distribution and probability density function of the 4 numerical columns:

1) raised hands

2) Visited Resources

3) Announcements Views

4) Discussion

The graphs for 'raised hands' and 'visited resources' show the most promise in differentiating between 'H' and 'L' classes. The bimodal shape of their density curves show distinct peaks at the opposite ends of the graph.


In [35]:
fig = plt.figure(figsize=(18,8))

plt.subplot(221)
data.raisedhands[data.Class == 'H'].plot(kind='kde') 
data.raisedhands[data.Class == 'M'].plot(kind='kde') 
data.raisedhands[data.Class == 'L'].plot(kind='kde') 
plt.legend(('High', 'Middle','Low'),loc='best') 
plt.title('Raised Hands')


plt.subplot(222)
data.VisITedResources[data.Class == 'H'].plot(kind='kde') 
data.VisITedResources[data.Class == 'M'].plot(kind='kde') 
data.VisITedResources[data.Class == 'L'].plot(kind='kde') 
plt.legend(('High', 'Middle','Low'),loc='best') 
plt.title('Visited Resources')


plt.subplot(223)
data.Discussion[data.Class == 'H'].plot(kind='kde') 
data.Discussion[data.Class == 'M'].plot(kind='kde') 
data.Discussion[data.Class == 'L'].plot(kind='kde') 
plt.legend(('High', 'Middle','Low'),loc='best')
plt.title('Discussion')

plt.subplot(224)
data.AnnouncementsView[data.Class == 'H'].plot(kind='kde') 
data.AnnouncementsView[data.Class == 'M'].plot(kind='kde') 
data.AnnouncementsView[data.Class == 'L'].plot(kind='kde') 
plt.legend(('High', 'Middle','Low'),loc='best')
plt.title('Viewed Announcements')


Out[35]:
<matplotlib.text.Text at 0x11ba99f60>

Numerical Feature Analysis: Relationship & Correlation

Correlations can tell us about the direction, and the degree (strength) of the relationship between two variables (or features)

Using Pearson's correlation coefficient (Pearson's r) where r ranges from -1 (negative relationship) to 0 (no relationship to 1 (positive relationship).

We see that the only relatively strong relationship r = 0.69 is between the two strongest indications of class differences as noted above: 'raised hands' and 'visited resources'.


In [12]:
raised_hands = data['raisedhands']
discussion = data['Discussion']
v_resources = data['VisITedResources']
v_announcements = data['AnnouncementsView']

def correlation(x,y):
    std_x = (x-x.mean())/x.std(ddof=0)
    std_y = (y-y.mean())/y.std(ddof=0)
    
    return (std_x * std_y).mean()

print ('Raised hands & Discussion: ', correlation(raised_hands, discussion))

print ('Visted Resources & Discussion: ', correlation(v_resources, discussion))

print ('Raised hands & Visited Resources: ', correlation(raised_hands, v_resources))


Raised hands & Discussion:  0.3393859910133952
Visted Resources & Discussion:  0.24329176916115017
Raised hands & Visited Resources:  0.6915717054692965

A more detailed view of correlation: Heatmap visualization

The stronger the correlation, the deeper the shade of purple, really just confirming our calcuations previously


In [34]:
fig,ax= plt.subplots(figsize=(9,7))
sns.heatmap(data.corr(),annot=True)


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b9aa908>

Scatterplot Visualization: Class ['H, M, L'] according to 'raisedhands' & 'visitedresources'

Zoom in on main features of importance:

To get a basic understanding of what we are trying to predict, we narrow down the analysis to the two main features that have thus far provided the most information in determining 'H, M, L' classes.

First we need a procedure to turn the class values (categorical) into numerical values in order to plot them and show the different points which are 'H'(large green points) or 'M' (medium yellow points) or 'L' (small red points).


In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

def new_grade(grade):
    if grade == 'H':
        return 100
    elif grade == 'M':
        return 50
    elif grade == 'L':
        return 10
    
def new_grades(grades): 
    return grades.apply(new_grade)

print (new_grades(data['Class']).head())

converted_grades = new_grades(data['Class'])



fig = plt.figure(figsize=(12, 9))
plt.scatter(data['VisITedResources'], data['raisedhands'],c= converted_grades, s = converted_grades, cmap = 'RdYlGn')
plt.ylabel("Number of times Student Raised their hand in class")
plt.xlabel("Number of times Student Visited Resources")
plt.title ('Interaction Correlation by Class Marks')
plt.colorbar()


0    50
1    50
2    10
3    10
4    50
Name: Class, dtype: int64
Out[14]:
<matplotlib.colorbar.Colorbar at 0x11aab6a20>

While there is a decent amount of noise, we can still see a linear relationship where students who raise their hands and visit class resources are in the group with the higher class.

There is an evident pattern where there's a cluster of 'H's at the top right hand corner of the graph and a cluster of 'L's at the bottom left hand corner of the graph.


In [15]:
sns.lmplot(x="VisITedResources", y="raisedhands", data=data)
sns.plt.show()


Exploring correlations between multidimensional data & plotting all pairs of values against each other

We see the relationship between class is strongest between 'raisedhands' and other features as well as 'visitedresources' and other features

The 2 weaker features seem to be 'announcements view' and 'discussion' where the points for the different 'H, M, L' classes seem to be scattered all over with no particular pattern.

Now that we have this analysis, we have a better understanding of which features will help us predict the target classes we want and which will be added noise. This is an importance part of the process of feature analysis.


In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



sns.pairplot(data, hue='Class', size=2.5);


Step #2. - Data Analysis & Application of Machine Learning Classification Models with Performance Evaluation

1. Preprocessing Data

  • Removal of outliers
  • Encoding categorical data

2. Machine Learning Models Used

  • Logistic Regression Classifier
  • Support Vector Machines Classifer
  • Random Forest Classifier (ensemble)
  • Gradient Boosting Classifier (ensemble)
  • Bagging Classifier (ensemble)

3. Evaluation Methods Used

  • Prediction Accuracy Score (on test data)
  • K-fold cross validation (on training data)
  • Plotting learning curves to assess Bias vs. Variance
  • Precision, Recall & F1 Scores

4. Implementation Process

  • Once the independent variables (features) have been determined, we split the data set into testing and training sets using sklearn's cross validation: train_test_split function
  • Next we import the models we need from sklearn
  • Scaling features (using standard scaler) & PCA (if applicable)
  • Tuning hyperparameters to optimize model performance (each model could require different constraints, weights or learning rates to generalize different data patterns)
  • Fit/train the data
  • Predict y-values using test data
  • Use sklearn's metrics to determine accuracy score
  • Use k-fold cross validation to avoid overfitting
  • Get additional performance metrics such as precision, recall and F1 score for comparison

Preprocessing data: remove outliers to reduce noise and improve accuracy

Here, the criteria used for determining outliers is based upon the summary of statistics table above (specifically the one separated by class). Using the mean for 'H' and 'L' students in the first two features, we identify a cutoff point for each. (For example, ff a student with an 'H' raised their hands less than the mean for which a student with 'L' raised their hands--they are removed from the dataset.)

Similarly, when looking at the bar graphs for attendance, we saw that very few 'H' students were absent more than 7 days, and very few 'L' students were absent less than 7 days-- also labeling them as outliers/candidates for removal.


In [17]:
#Creating the criteria for an outlier based on multiple conditions. Only dealing with 'H' and 'L' class to be safe. 

outliers_1 = data[(data['Class'] == 'H') & (data['raisedhands'] <= 17)]
outliers_2 = data[(data['Class'] == 'H') & (data['VisITedResources'] <= 18)]
outliers_3 = data[(data['Class'] == 'L') & (data['raisedhands'] >= 70)]
outliers_4 = data[(data['Class'] == 'L') & (data['VisITedResources'] >= 78)]
outliers_5 = data[(data['Class'] == 'H') & (data['StudentAbsenceDays'] == 'Above-7')]
outliers_6 = data[(data['Class'] == 'L') & (data['StudentAbsenceDays'] == 'Under-7')]



#dropping the rows which contained the outliers as indicated by the above criteria

new_data = data.drop([14,47,48,72,74,80,84,86,87,88,94,96,124,128,129,190,200,205,226,227,
                      228,248,250,255,344,345,444,445,450])

#Using shape to check the number of outliers dropped, usually no more than 10%.
#However, since the dataset is small, better to leave more data to work with. 
#451 left out of the orginal 480 observations: We only dropped ~ 6% of data. 

print(new_data.shape)


(451, 17)

Preprocessing Data: Label Encoding Categorical Values to transform the prediction target (y)

Libraries used: Scikit-Learn

In Sklearn, machine learning algorithms require the input variables from the data to be real values. Therefore we must format (or transform) the data into the structure that allows us to feed it into the model.

Here, we use a label encoder to prepare the data where numerical dimensions are used to represent membership in the categories ('H, L, M')

As we can see, once encoded:

Categorical Class Numerical value
H (High grade) 0
L (Low grade) 1
M (Middle grade) 2

In [18]:
#Preprocessing data to encode categorical values for the y-target column


from sklearn import preprocessing
le = preprocessing.LabelEncoder()

target_value = le.fit_transform(new_data.Class)

print (target_value)


[2 2 1 1 2 2 1 2 2 2 0 2 1 1 2 2 2 2 0 2 2 2 1 1 1 2 1 2 2 0 1 1 1 1 1 1 2
 1 2 1 2 1 2 2 1 1 2 1 1 2 0 1 1 1 1 2 2 1 2 0 2 1 1 2 0 0 2 1 2 2 2 2 2 1
 0 1 1 2 1 1 1 0 0 0 0 2 2 2 2 0 1 1 2 1 2 0 2 2 0 2 1 1 1 1 2 0 2 2 2 1 2
 2 1 2 1 1 2 1 1 0 0 0 2 0 2 1 1 2 0 1 2 0 2 2 0 0 2 0 1 2 0 2 2 1 2 0 2 0
 2 2 0 2 0 0 2 0 2 1 1 2 1 0 2 0 2 0 1 0 2 1 0 2 2 0 2 1 2 2 2 2 0 0 1 2 0
 2 2 1 2 2 2 2 0 2 0 1 1 1 2 2 0 2 2 2 2 0 0 2 1 2 1 2 2 2 1 1 2 2 0 0 2 1
 2 0 2 0 2 2 1 2 1 0 0 2 2 1 1 2 2 2 2 0 2 2 2 2 0 2 2 0 0 0 0 0 2 2 0 0 0
 0 2 2 0 0 2 2 1 1 0 0 2 2 0 0 2 2 1 1 2 2 2 2 0 0 2 2 2 2 0 0 0 0 0 0 0 0
 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 0 0 1 1 1 1 2 2 0 0 2 2 0
 0 2 2 0 0 0 0 2 2 0 0 2 2 1 1 1 1 2 2 1 1 1 1 0 0 0 0 2 2 1 1 2 2 0 0 0 0
 2 2 0 0 2 2 0 0 0 0 1 1 2 2 0 0 2 2 1 1 0 0 0 0 0 0 2 2 0 0 2 2 1 1 0 0 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 2 2 2 2 0 0 0 0 2 2 2 2 0 0 0 0 1 1 2 2 2
 2 1 1 2 2 1 1]

Preprocessing Data: Dummy/One-hot Encoding to transform the features values (x)

A common alternative approach to encoding categorical values is called dummy (or one-hot) encoding where the the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column

Pandas supports this feature using get_dummies to create dummy/indicator variables (aka 1 or 0).


In [19]:
data_dummies = pd.get_dummies(new_data, columns = ['ParentschoolSatisfaction', 'StudentAbsenceDays','Relation'])

print(data_dummies.shape)

data_dummies.head()


(451, 20)
Out[19]:
gender NationalITy PlaceofBirth StageID GradeID SectionID Topic Semester raisedhands VisITedResources AnnouncementsView Discussion ParentAnsweringSurvey Class ParentschoolSatisfaction_Bad ParentschoolSatisfaction_Good StudentAbsenceDays_Above-7 StudentAbsenceDays_Under-7 Relation_Father Relation_Mum
0 M KW KuwaIT lowerlevel G-04 A IT F 15 16 2 20 Yes M 0 1 0 1 1 0
1 M KW KuwaIT lowerlevel G-04 A IT F 20 20 3 25 Yes M 0 1 0 1 1 0
2 M KW KuwaIT lowerlevel G-04 A IT F 10 7 0 30 No L 1 0 1 0 1 0
3 M KW KuwaIT lowerlevel G-04 A IT F 30 25 5 35 No L 1 0 1 0 1 0
4 M KW KuwaIT lowerlevel G-04 A IT F 40 50 12 50 No M 1 0 1 0 1 0

Logistic Regression Classifer

The logistic regression classifier is a method used to generalizes logistic regression to multiclass problems (applicable here since we are trying to predict more than two possible discrete outcomes). It is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable

Dependent variable (y values) = Class ('H, L, M')

Independent variables (or features / x values) = 'raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad', 'Relation_Father', Relation_Mum','ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7','StudentAbsenceDays_Under-7'

In Logistic Regression modeling, input values (X) are combined linearly using weights or coefficient values to predict an output value (y)--- or more specifically, the probability that an input (X) belongs to the default class.

Pros:

  • Low variance
  • Provides probabilities for outcomes
  • works well with diagonal (feature) decision boundaries

Cons:

  • Doesn’t perform well when feature space is too large
  • High bias
  • Relies on entire data

In [30]:
feature_cols = feature_cols = ['raisedhands', 'VisITedResources',  'ParentschoolSatisfaction_Bad',  'Relation_Father', 
                               'Relation_Mum','ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7',
                                'StudentAbsenceDays_Under-7']

X = data_dummies[feature_cols]
y = target_value


from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)


import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe_lr = Pipeline([('rs', StandardScaler()), ('pca', PCA(n_components = 4)), 
                    ('logreg', LogisticRegression(C=1e9))])

pipe_lr.fit(X_train, y_train)

y_pred = pipe_lr.predict(X_test)

from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))


from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

from sklearn.metrics import classification_report
print(' ', classification_report(y_test, y_pred))


Prediction Accuracy: 0.764705882353
Cross-validated Scores: [ 0.675       0.74358974  0.76923077  0.73684211  0.81578947  0.73684211
  0.84210526  0.89473684  0.76315789  0.72972973]
CV Accuracy: 0.77 (+/- 0.12)
               precision    recall  f1-score   support

          0       0.61      0.65      0.63        17
          1       0.88      1.00      0.94        23
          2       0.75      0.64      0.69        28

avg / total       0.76      0.76      0.76        68

Visualization of Learning Curves: Test & Training Data

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.

If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data.

We can use the sklearn learning_curve function to generate the values that are required to plot such a learning curve (number of samples that have been used, the average scores on the training sets and the average scores on the validation sets)

Learning curve allows us to verify when a model has learning as much as it can about the data, indicated by: a) the performances on the training and testing sets reach a plateau and b) here is a consistent gap between the two error rates, as is consistent with our graph below.

Our Learning Curves show a decent model because a) the testing and training learning curves converge at similar values and b) the smaller the gap between curves, the better our model generalizes. The results of our graph represent moderate bias and low variance, which is an indication that we should increase model complexity--leading us to the ensemble methods we'll be using moving forward.


In [22]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Testing score")

    plt.legend(loc="best")
    return plt

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = svc
plot_learning_curve(estimator, X, y, (0.7, 1.01), cv=cv, n_jobs=4)


Out[22]:
<module 'matplotlib.pyplot' from '/Users/jadeshao/anaconda/lib/python3.6/site-packages/matplotlib/pyplot.py'>

Random Forest Classifier

A random forest is an ensemble method (combination of learning algorithms) that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging (majority of votes to make a prediction) to improve the predictive accuracy and control over-fitting.

At the root of random forest are decision trees, which are a type of flowchart which assist in the decision making process. Internal nodes represent tests on particular attributes, while branches exiting nodes represent a single test outcome, and leaf nodes represent class labels. The goal is to split on the attributes which create the purest child nodes possible.

Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Pros:

  • Accurate and does not tend to overfit
  • Robust against outliers in the predictive variables
  • It gives estimates of what variables are important in the classification--> feature selection

Cons:

  • Not as easy to visually interpret
  • Slower runtime

In [33]:
# RANDOM FOREST CLASSIFIER model -- ensemble method No.1

feature_cols = ['raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad',  
                'ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7', 'Relation_Father', 
                               'Relation_Mum',
                'StudentAbsenceDays_Under-7']

X = data_dummies[feature_cols]
y = target_value


# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)

from sklearn.preprocessing import StandardScaler
scl = StandardScaler()
scl.fit_transform(X_train, y_train)
scl.transform(X_test)


from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=3, n_estimators=200, oob_score = True, n_jobs = -1, 
                                                   random_state=50)
rfc.fit(X_train, y_train)


y_pred = rfc.predict(X_test)
from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))


from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

from sklearn.metrics import classification_report
print('Scores', classification_report(y_test, y_pred))


Prediction Accuracy: 0.838235294118
Cross-validated Scores: [ 0.725       0.84615385  0.66666667  0.79487179  0.71052632  0.84210526
  0.68421053  0.68421053  0.78378378  0.83783784]
CV Accuracy: 0.76 (+/- 0.14)
Scores              precision    recall  f1-score   support

          0       0.86      0.79      0.83        24
          1       0.87      0.93      0.90        14
          2       0.81      0.83      0.82        30

avg / total       0.84      0.84      0.84        68

Random Forest: Feature Importance

One of the best use cases for random forest is that it's a great tool for feature selection. Since random forest is built on having multiple decision tress, one of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.

Random forests measures feature importance through something called the Gini Importance or Mean Decrease in Impurity (MDI) calculates each feature importance as the sum over the number of splits (accross all tress) that include the feature, proportionaly to the number of samples it splits.


In [24]:
# Taking a look at feature importance via Random Forest

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

feat_imp = pd.Series(rfc.feature_importances_, index=X.columns)
feat_imp.sort_values(inplace=True, ascending=False)
feat_imp.head(20).plot(kind='barh', title='Feature importance')


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c167eb8>

Gradient Boosting Classifier

Gradient Boosting is based on the the idea of boosting, which is a method of trying to modify a weak learner into a becoming a better one. It starts with filtering observations, leaving those observations that the weak learner can handle and focusing on developing new weak learns to handle the remaining difficult observations. For example, the model will build trees one at a time, where each new tree helps to correct errors made by previously trained tree.

With Gradient Boosting, the objective is to minimize the loss of the model by adding weak learners using a gradient descent like procedure. This type of algorithm can be described as a stage-wise additive model. This is because one new weak learner is added at a time and existing weak learners in the model are frozen and left unchanged.

Pros:

  • Can easily handle qualitative (categorical) features
  • Very powerful and performs well in most cases

Cons:

  • Training generally takes longer because of the fact that trees are built sequentially
  • Harder to fit/tune parameters

In [27]:
# Gradient Boosting Classifer: Ensemble method No.2

feature_cols = ['raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad',  
                'ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7', 'Relation_Father', 
                               'Relation_Mum',
                'StudentAbsenceDays_Under-7']

X = data_dummies[feature_cols]
y = target_value


# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)

from sklearn.preprocessing import StandardScaler
scl = StandardScaler()
scl.fit_transform(X_train, y_train)
scl.transform(X_test)

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(learning_rate = 0.15, n_estimators = 300, max_depth = 8, min_samples_leaf = 3, 
                                 max_features = 'log2')
gbc.fit(X_train,y_train)

y_pred = gbc.predict(X_test)

from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))


from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

from sklearn.metrics import classification_report
print('Scores', classification_report(y_test, y_pred))


Prediction Accuracy: 0.867647058824
Cross-validated Scores: [ 0.8974359   0.71794872  0.76923077  0.84615385  0.65789474  0.78947368
  0.73684211  0.65789474  0.73684211  0.81081081]
CV Accuracy: 0.76 (+/- 0.15)
Scores              precision    recall  f1-score   support

          0       0.77      0.89      0.83        19
          1       0.94      0.94      0.94        18
          2       0.89      0.81      0.85        31

avg / total       0.87      0.87      0.87        68

Summary of Scores & Conclusion:

Although the majority of prediction accuracy and cross-validated scores were only in the [upper 70 - low 80] percentile range, with the best model: Gradient Boosting doing slightly better than the rest at 87% prediction accuracy. Overall, it can be said that the models performed well in a dataset where the signal to noise ratio was relatively low and little to none feature engineering was done.

However, given the nature of the project, which was to determine which students were in need of intervention, we can see that the results of the model were very favorable to our cause. In such a case, we often care more about the correct classification of 'L' students who are the ones we want to look out for, and much less about the labeling of 'H' and 'M' students.

For example: An 'H' student being labeled as an 'L' student is much less of a concern---as teachers would simply have spent a little extra time verifying that the student wasn't actually an at-risk student.

Whereas an 'L' student being labeled as an ['H' or 'M'] student would be a much bigger problem---as teachers would have skipped over them entirley and that student would have missed a chance for intervention.

Therefore, looking at the precision and recall scores for label-encoded-target-value 1 (representing 'L' students), we see the scores for correctly identifying 'L' or 1 students at a much higher percentage that the prediction accuracy/cross-validated scores of the dataset as a whole.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

Precision & Recall scores for row 1 ('L' Students) only:

Model Used Accuracy Cross-Validated Precision Recall
Logistic Regression 0.76 0.77 (+/- 0.12 0.88 1.00
Random Forest 0.83 0.76 (+/- 0.14) 0.87 0.93
Gradient Boosting 0.87 0.76 (+/- 0.15) 0.94 0.94

Conclusion

Having implented several models to observe which was the best fit for our dataset, results show that the top performing model was:

Gradient Boosting Classifier with an Accuracy Score of: 87%

In evaluating the overall performance of this project, and taking into consideration the original aim in which the goal was to have the classifier find all the positive samples of 'L' students for intervention identification, it's evident that through examining the high 'Recall' scores (ranging from .92 to 1.00), our model did exceptionally well in this area.


In [ ]: